Apparatus And A Method For Audio Signal Processing

ABSTRACT

An apparatus, such as a headset, configured to process audio signals from multiple microphones, comprising: a first pair of microphones ( 101, 102 ) outputting a first pair of microphone signals and a second pair of microphones ( 103, 104 ) outputting a second pair of microphone signals; a first beamformer ( 105 ) and a second beamformer ( 106 ) each configured to receive a pair of microphone signals and adapt the spatial sensitivity of a respective pair of microphones as measured in a respective beamformed signal (X L ; X R ) output from a respective beamformer ( 105; 106 ); wherein the spatial sensitivity is adapted to suppress noise relative to a desired signal; a third beamformer ( 107 ) configured to dynamically combine the signals (X L ; X R ) output from the first beamformer ( 105 ) and the second beamformer ( 106 ) into a combined signal (X C ); wherein the signals are combined such that signal energy in the combined signal is minimized while a desired signal is preserved; and a noise reduction unit ( 109 ) configured to process the combined signal (X C ) from the third beamformer ( 107 ) and output the combined signal such that noise is reduced.

It has been discovered that use of multiple microphones and the use of beamforming techniques provide audio signal reproduction that is superior to single microphone or non-beamforming systems. The multiple microphones are located at different positions and allows so-called spatial sampling which in turn enables cancelling of noise interfering with a desired signal such as a person's voice; this is also known as beamforming, spatial filtering or noise-cancelling. Subsequent time varying post-filters are often applied as a means to further discriminate the person's voice from (background) noise signals.

Multiple microphones and the use of beamforming techniques are frequently embodied in headsets, hearing aids, laptop computers and other electronic consumer devices.

The technical field of beamformers has been extensively researched; however their qualities and configurations have not been fully exploited.

RELATED PRIOR ART

US 2012/0020485 discloses an audio signal processing method which estimates a first indication of a direction of arrival, relative to a first pair of microphones, of a first sound component received by the first pair of microphones; and estimates a second indication of a direction of arrival, relative to a second pair of microphones, of a second sound component received by the second pair of microphones. The first and the second pair of microphones are arranged at respective sides of a person's head during normal operation of a device using the method. The method also involves controlling gain of an audio signal to produce an output signal, based on the first and second direction indications.

SUMMARY

There is provided an apparatus, such as a headset, configured to process audio signals from multiple microphones, comprising: a first pair of microphones outputting a first pair of microphone signals and a second pair of microphones outputting a second pair of microphone signals; wherein the first pair of microphones are arranged with a first mutual distance and the second pair of microphones are arranged with a second mutual distance, and wherein the first pair of microphones are arranged at a distance from the second pair of microphones that is greater than the first mutual distance and the second mutual distance at least when the apparatus is in normal operation; a first beamformer and a second beamformer each configured to receive a pair of microphone signals and adapt the spatial sensitivity of a respective pair of microphones as measured in a respective beamformed signal output from a respective beamformer; wherein the spatial sensitivity is adapted to suppress noise relative to a desired signal; a third beamformer configured to dynamically combine the signals output from the first beamformer and the second beamformer into a combined signal; wherein the signals are combined such that noise energy in the combined signal is minimized while a desired signal is preserved; and a noise reduction unit configured to process the combined signal from the third beamformer and output the combined signal such that noise is reduced.

Thus, beamforming is provided in a first beamforming stage with the first beamformer and the second beamformer processing the microphone signals and in a second stage with a third beamformer processing signals output from the first stage. The first beamforming stage serves to enhance or emphasize the desired signal locally with respect to the microphone pairs by adapting the spatial sensitivity of a respective microphone pair. The spatial sensitivity is adapted, e.g., by adjusting beamformer coefficients to control the spatial configuration of the beamformer nulls which may comprise adjusting beamformer coefficients such that the beamformer obtains an omni-directional characteristic, which is useful to avoid amplification of uncorrelated (between microphones) noise such as wind noise. The effectiveness of the first beamforming stage depends on the assumption that the microphones of each microphone pair are situated closely to one another (for reasons explained below).

In addition to such local optimization in capturing a desired signal, the level of the noise component may vary considerably between the first and second beamformed signals. This may be due to different levels at the microphones, e.g., wind turbulence is a highly local phenomenon, and acoustic shadowing effects from the user's head in a head worn device. Furthermore, the first and the second beamformers may not be able to cancel the noise equally well, depending on the relative position of the microphone pair, the signal of interest and interfering noises.

The third beamformer is thus configured to receive signals that have already been subject to local optimization by the first stage beamformers whereby the desired signal is isolated as far as possible. By dynamically combining signals from the left-hand side and the right-hand side, it is possible to select or emphasize a spatially controlled signal from the most favourably positioned microphone pair.

Processing microphone signals in this way, improves the effect of noise suppression by the noise reduction unit when, as claimed, it is configured to process the combined signal from the third beamformer. This is partly ascribed to the observation that desired signals stands out clearer after such a two-stage beamforming and thereby makes noise suppression more effective. Furthermore, the two-stage beamformer approach achieves the combined benefit of beamforming on microphones that are closely spaced and microphones that are not closely spaced using well known dual-microphone beamformers. The third beamformer may combine its input signals by linear or non-linear weighing of the input signals.

The apparatus, such as a headset, a hearing aid or another apparatus picking up audio signals by means of microphones may be configured to be worn by a person with the first pair of microphones arranged on a left-hand side of a person's head and the second pair of microphones arranged on the right-hand side of the person's head. Typically, the two pairs of microphones are sitting on an ear-cup of a headphone, a spectacle frame or booms or other protrusions at respective sides of a person's head. The microphones are arranged, at least approximately, in a so-called end-fire configuration. The microphones may alternatively or additionally be arranged in a broadside configuration.

By arranging the microphones, such that intra-pair microphones sit closer than inter-pair microphones at least when the apparatus is in normal operation and intra-pairs in end-fire configurations, the first and the second beamformer can take advantage of any near-field effect to cancel or suppress more noise at low frequencies and in addition make it possible to cancel more noise at higher frequencies, avoiding spatial aliasing. Additionally, the third beamformer can take advantage of the different local noise levels that the different pairs of microphones are exposed to. When the microphone pairs sit on different sides of a person's head, the head may form a wind and/or sound shadow reducing noise level on one side of the person's head. It is a major advantage of the invention that the highly complex problem of designing a single adaptive beamformer operating on all microphone inputs is decomposed into three simple, robust, well-understood dual-microphone beamformers.

In general, different types of microphones with different characteristics may be selected.

A desired signal is a signal that typically represents voice from a speaker within proximity of the microphones or voice appearing from a certain direction relative to the orientation of the microphones. A desired signal may be characterised by being emitted from one or more sound sources having predefined spatial locations with respect to the spatial location of the microphones. Since multiple microphones are used to pick up the desired signal the desired signal may be characterised by a predefined phase and/or amplitude difference among the microphone signal and/or among beamformed signals. A desired signal may also be characterised by a predefined temporal characteristic and/or a predefined phase-/amplitude-frequency characteristic.

A noise signal or simply noise may include turbulence sounds induced by wind occurring at sufficiently high wind speeds and acting on the microphone membranes. Noise may also include background sounds such as tones from machines, sounds from items rattling or chinking, sounds from people talking amongst each other, etc. In some definitions, noise is characterised by being emitted from one or more sound sources that are located at other locations than the desired signal.

The first beamformer and the second beamformer adapt the directional sensitivity gradually or in steps e.g. comprising sensitivities that are at least approximated from the group of the following characteristics: Omni-directional, bi-directional, cardioid, subcardioid, hypercardioid, supercardioid or shotgun. The directional sensitivity may be changed gradually between an omni-directional, a bi-directional and a cardioid characteristic. The first beamformer may be configured as disclosed in WO 2009/132646 which is hereby incorporated by reference for everything disclosed in connection with especially FIG. 1 thereof.

The third beamformer may combine the signals from the first and the second beamformer in accordance with coefficients estimated from noise powers. In case the noise power of the signal from the first beamformer is higher than the noise power of the signal from the second beamformer, the signal from the second beamformer is weighted higher than the signal from the first beamformer and vice versa. The noise level of a signal may be estimated when voice is detected as not present.

The first mutual distance between the microphones of the first pair and the second mutual distance between the microphones of the second pair is shorter than the minimum wavelength of interest in the case of end-fire pairs, depending on the desired directional sensitivity. At and above frequencies with a shorter wavelength than the wavelength of interest, the ability to suppress or cancel noise will diminish due to the effect of spatial aliasing. The distance between the microphone pairs may correspond to the straight-line distance between a person's two ears, which may be about 18-22 cm. The first mutual distance and the second mutual distance may be about 10, 20, or 40 mm for a bandwidth of interest up to 4 KHz.

In general, the apparatus may perform signal processing in a time-domain or in a time-frequency-domain. In the latter case, time-to-frequency transformations are performed on signal blocks of a predefined duration on a running basis. In the time-frequency-domain signals are represented as time-domain samples in a number of frequency bins. Correspondingly, frequency-to-time reconstruction is performed on signals processed in the time-frequency-domain.

In some embodiments the noise reduction unit is configured to perform noise suppression on the combined signal from the third beamformer in response to a noise suppression coefficient; and the noise suppression coefficient is estimated from the microphone signals and/or a beamformed signal.

The noise suppression coefficient may comprise a first coefficient estimated from the first set of microphone signals and from a/the beamformed signal. The noise suppression coefficient may alternatively or additionally comprise a second coefficient estimated from the second set of microphone signals and from a/the beamformed signal. The noise suppression coefficient may be combined from the first and the second coefficient.

The noise suppression coefficient may be a gain factor of a multiplier in a time-frequency domain or a filter coefficient of a time-domain filter.

In some embodiments the apparatus comprises: a first control branch synthesizing a first noise suppression gain from the first pair of microphone signals and/or the first beamformer; a second control branch synthesizing a second noise suppression gain from the second pair of microphone signals and/or the second beamformer; and a selector configured to dynamically select and/or output the first noise suppression gain or the second noise suppression gain; wherein the noise reduction unit is configured to process the combined signal from the third beamformer in response to the selected and/or output noise suppression gain from the selector.

Thereby it is possible to dynamically select the first or the second noise suppression gain such that it is in accordance with signal quality measures estimated from respective beamformed signal output from a respective beamformer and respective noise suppression gains. This is expedient since the first and the second noise reduction gains may be computed under conditions which are not equally favourable. As a consequence, the noise may not be suppressed equally well and/or the desired signal may not be preserved equally well. For example, the mechanism for computing the first noise suppression gain may have access to signals which lend themselves to easier discrimination of the noise and the desired signal. This condition may arise from the situation where noise is less powerful at the input to the first beamformer due to a user's head shadow causing less wind noise or background noise. The condition may also arise from the situation where the spatial cues employed by the first noise suppression computation are more discriminative.

A hysteresis or threshold may be applied and used as a criterion on whether to enable the selector or not. Thereby it is possible to disable switching when an estimated noise level is below a predefined hysteresis or threshold. The hysteresis or threshold may be in the range of about 1 dB to about 3 dB. Thereby, it is possible to strike a trade-off between (1) achieving lowest output noise level and (2) minimize distortion of a desired signal such as a voice signal.

In some embodiments the selector is configured to operate in response to a first signal quality indicator and a second signal quality indicator; the signal quality indicators are synthesized from a respective beamformed signal processed to reduce noise in response to respective noise reduction gains.

In terms of noise suppression, an important aspect of signal quality is signal-to-noise ratio. As an example, with reference to FIG. 2, when using the beamformed, noise reduced signals as input to Signal Quality Evaluation, signal-to-noise ratio is influenced through X_(L) and X_(R). For example, if the signal-to-noise ratio of X_(L) is greater than that of X_(R), in cases where A_(L) and A_(R) reduce the noise component by the same factor, the signal-to-noise ratio of A_(L)X_(L) will be higher than that of A_(R)X_(R).

Furthermore, the Signal Quality Evaluation is influence by the qualities of A_(L) and A_(R). In some cases, speech is easier distinguishable from noise at one side of the head. A reason is that a user's head may shield the microphones from wind on a lee side of the user's head. Another reason is that the spatial cues employed by the noise suppression computation may be discriminated more clearly on the lee side of the user's head.

The signal quality indicators P_(L); P_(R), may be computed from the mean-squared product of the respective noise reduction gains, A_(L); A_(R), and the respective beam-formed signals X_(L); X_(R). The signal quality indicators may be computed per frequency band or accumulated across all frequency bands.

In some embodiments a beamformed signal, processed to reduce noise in response to respective noise reduction gains, is input to an evaluator that is configured to output a control signal to the selector and thereby control selection; and the evaluator evaluates the beamformed signal, processed to reduce noise in response to respective noise reduction gains, according to a criterion of least power during a time interval when voice activity is detected as not present.

Thereby, the selection of respective noise suppression gains can be performed from an evaluation of the noise conditions (e.g. noise power) at respective sides of a person's head.

Least noise power of the left and the right beamformed, noise reduced signals used as a selection criterion combines a number of quality parameters into a simple computation. As previously mentioned, noise power is a similar measure of signal-to-noise ratio when the microphone inputs are aligned through alignment filters, but it is simpler to compute.

When noise reduction is performed, there is a risk of introducing voice processing artifacts that degrades voice quality. The noise power measure, used in the least noise power criterion, selects for higher voice quality in many cases. When the criterion is based on least power, preference is associated with signals where it is easier to detect all parts of the voice component, especially the low-level parts, which in turn leads to fewer audible instances of voice processing artifacts. A voice activity detector may output a signal indicative of whether voice activity is detected or not. Voice activity may be detected when an amplitude or peak magnitude or power level of one or more microphone signals and/or a beamformed signal exceed a predefined or time-varying threshold. The level of the threshold may be adapted to an estimated noise level.

In some embodiments the noise suppression coefficient is computed to reduce noise by a predetermined, fixed factor.

The predetermined factor may be e.g. 13 dB, 6 dB, 10 dB, 15 dB or another factor. This may be achieved by limiting the noise suppression gain to the predetermined factor.

As an example, an estimated noise level at the output of the first beamformer and the second beamformer may be, say, −30 dB and −20 dB, respectively; the fixed factor may be say 10 dB; and consequently, the estimated noise level after noise suppression is then −40 dB and −30 dB, respectively.

The left and right signal beamformed signals may be matched in level towards the signal of interest, e.g. using alignment filters/gains on the microphones at any point in the signal chain preceding the noise suppression gain selection module. As a beneficial consequence of using fixed noise suppression factors and level-matched left and right channels, noise power computations are conditioned to serve as left and right signal quality measures which reflect the signal-to-noise ratios of the left and right beamformer outputs to a higher degree.

In some embodiments at least one of the first beamformer or the second beamformer is configured to comprise: a first stage that generates a summation signal and a difference signal from the input signals, subject to at least one of the input signals being phase and/or amplitude aligned with another of the input signals with respect to a desired signal; and a second stage that filters the difference signal and generating a filtered signal; wherein the beamformed output signal is generated from the difference between the summation signal and the filtered signal; and wherein the filter is adapted using a least mean square technique to minimize the power of the beamformed output signal.

Thereby the first and/or the second beamformer selectively and adaptively cancel out sound from certain directions.

The filter may have a low-pass characteristic to enhance lower frequency components relative to higher frequency components. The filter may be a bass-boost filter.

Such a beamformer may be configured as disclosed in WO 2009/132646 which is hereby incorporated by reference for everything it discloses.

In some embodiments the third beamformer is configured with a fixed sensitivity with respect to a predefined spatial position relative to the spatial position of the microphones.

A fixed sensitivity means that the third beamformer applies a fixed frequency response with respect to sound emanating from an acoustic source at the predefined spatial position.

The predefined position is located in a predefined way with respect to the spatial position and orientation of the first set of microphones and the second set of microphones. The predefined space is preferably centred about a person's mouth when the apparatus is worn by the person in a normal way.

Beamforming coefficients of the third beamformer may be constrained to sum to a fixed gain e.g. unity gain towards the spatial position. The gain is fixed in the sense that it is not adaptive. However, the gain may be adjusted in connection with calibration or as a preference setting.

The third beamformer may combine the input signals by a linear combination. Alternatively, the signals may be combined by a non-linear combination.

In some embodiments the microphones output digital signals; the apparatus performs a transformation of the digital signals to a time-frequency representation, in multiple frequency bands; and the apparatus performs an inverse transformation of at least the combined signal to a time-domain representation.

The transformation may be performed by means of a Fast Fourier Transformation, FFT, applied to a signal block of a predefined duration. The transformation may involve applying a Hann window or another type of window. A time-domain signal may be reconstructed from the time-frequency representation via an Inverse Fast Fourier Transformation, IFFT.

The signal block of a predefined duration may have duration of 8 ms with 50% overlap, which means that transformations, adaptation updates, noise reduction updates and time-domain signal reconstruction are computed every 4 ms. However, other durations and/or update intervals are possible. The digital signals may be one-bit signals at a many-times oversampled rate, two-bit or three-bit signals or 8 bit, 10 bit, 12 bit, 16 bit or 24 bit signals.

In alternative implementations/embodiments, all or parts of the system operate directly in the time-domain. For example, noise suppression may be applied to a time domain signal by means of FIR or IIR filtering, the noise suppression filter coefficients computed in the frequency domain.

In some embodiments the microphones output analogue signals; the apparatus performs analogue-to-digital conversion of the analogue signals to provide digital signals; the apparatus performs a transformation of the digital signals to a time-frequency representation, in multiple frequency bands; and the apparatus performs an inverse transformation of at least the combined signal to a time-domain representation.

In some embodiments the microphones of at least one pair of the set of microphones is arranged in an end-fire configuration oriented towards a position where a person's mouth is expected to be when the apparatus is used by the person. Such a configuration has shown to give good noise cancelling and suppression, e.g., for headsets or hearing aids.

There is also provided a method for processing audio signals from multiple microphones, comprising: receiving a first pair and a second pair of microphone signals from a first pair of microphones and a second pair of microphones, respectively; wherein the first pair of microphones are arranged with a first mutual distance and the second pair of microphones are arranged with a second mutual distance, and wherein the first pair of microphones are arranged at a distance from the second pair of microphones that is greater than the first mutual distance and the second mutual distance at least when the apparatus is in normal operation; performing first beamforming and second beamforming on the first pair of microphone signals and the second pair of microphone signals to output respective beamformed signals; adapting the spatial sensitivity by a respective pair of microphones as measured in a respective beamformed signal such that spatial sensitivity is adapted to suppress noise relative to a desired signal; performing third beamforming to dynamically combine the signals output from the first beamforming and the second beamforming into a combined signal; wherein the signals are combined such that noise energy in the combined signal is minimized while a desired signal is preserved; and performing noise reduction to process the combined signal from the third beamformer and output the combined signal such that noise is reduced.

There is also provided a computer program product, e.g. stored on a computer-readable medium such as a DVD, comprising program code means adapted to cause a data processing system to perform the steps of the method, when said program code means are executed on the data processing system.

There is also provided a computer data signal, e.g. a download signal, embodied in a carrier wave and representing sequences of instructions which, when executed by a processor, cause the processor to perform the steps of the method.

Here and in the following, the terms ‘processing means’ and ‘processing unit’ are intended to comprise any circuit and/or device suitably adapted to perform the functions described herein. In particular, the above term comprises general purpose or proprietary programmable microprocessors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Programmable Logic Arrays (PLA), Field Programmable Gate Arrays (FPGA), special purpose electronic circuits, etc., or a combination thereof.

BRIEF DESCRIPTION OF THE FIGURES

The above and/or additional objects, features and advantages of the present invention will be further elucidated by the following illustrative and non-limiting detailed description of embodiments of the present invention, with reference to the appended drawings, wherein:

FIG. 1 shows a block diagram of a signal processor;

FIG. 2 shows a more detailed block diagram of the signal processor; and

FIG. 3 shows different configurations of an apparatus with multiple microphones.

DETAILED DESCRIPTION

In the following description, reference is made to the accompanying figures, which show, by way of illustration, how the invention may be practiced.

FIG. 1 shows a block diagram of a signal processor and a first and second pair of microphones. The first set of microphones, 101 and 102, and the second set of microphones, 103 and 104, are arranged with an intra-pair distance between the microphones that is relatively short compared to the microphone pairs inter-distance, between the pairs of microphones. The signal processor is designated by reference numeral 100.

The first pair of microphones 101 and 102 outputs a first microphone signal pair input to a first beamformer 105 and the second pair of microphones 103 and 104 outputs a second microphone signal pair, which is input to a second beamformer 106. The first beamformer 105 and the second beamformer 106 outputs respective output signals X_(L) and X_(R).

The first beamformer 105 and the second beamformer 106 are each configured to adapt their spatial sensitivity. The spatial sensitivity is adapted to cancel or suppress noise relative to a desired signal. The first beamformer and the second beamformer may be configured as disclosed in WO 2009/132646.

The third beamformer 107 is configured to dynamically combine the signals, X_(L); X_(R), output from the first beamformer 105 and the second beamformer 106 into a combined signal X_(C). The combined signal X_(C) can be expressed by the following expression:

X _(C) =G _(L) X _(L) +G _(R) X _(R)

Where G_(L) and G_(R) represent transfer functions from a first input at which X_(L) is received and from a second input at which X_(R) is received, respectively. The above expression relies on a frequency domain representation; X_(L) and X_(R) are complex numbers. An equivalent representation exists for a time-domain representation. The third beamformer is configured to adjust real or complex G_(L) and G_(R) dynamically to output X_(C) with a lowest noise level while preserving a desired signal.

The following expression is an example of how real G_(L), G_(R) may be computed:

${\hat{G}}_{L} = \frac{{\langle{X_{R}}^{2}\rangle} - {{Re}\; {\langle{X_{L}X_{R}^{*}}\rangle}}}{\langle{{X_{L} - X_{R}}}^{2}\rangle}$ Ĝ_(R) = Ĝ_(L) − 1

where Re is the real part of a complex number, .*,

•

and |•| represent complex conjugate, averaging across a time interval and absolute value, respectively.

The above expressions for real Ĝ_(L) and Ĝ_(R) are solutions to a mean squares cost function subject to a constraint:

${\hat{G}}_{L} = {\arg \; {\min\limits_{G_{L}}{\langle{X_{C}}^{2}\rangle}}}$ subject  to: Ĝ_(L) + Ĝ_(R) = 1

That is, the mean-squares of X_(C) are minimized as a function of real G_(L), subject to a constraint. The constraint ensures that the desired signal is favoured over signals from at least some other locations.

In some embodiments matching filters are inserted between the microphones and the inputs to the beamformers of the first stage i.e. in the shown embodiment the first and the second beamformer. Thereby filtering the input signals to the first and the second beamformers so that the desired signal component is sufficiently identical in all the inputs, i.e., with respect to phase and amplitude. The filters compensate for variations in acoustic path of the desired signal to the microphones as well as variations in microphone sensitivities or other variations. Such matching filters may also be denoted alignment filters and matching may be denoted alignment. As a result of the input alignment with respect to the desired source, the output desired signal component of the first and second beamformers are similarly identical due to the inbuilt constraints (e.g. as described in WO 2009/132646). That is, the inputs to the third beamformer are sufficiently identical with respect to the desired signal component. As a consequence, the G_(L)+G_(R)=1 constraint leads to the output and inputs of the third beamformer being sufficiently identical with respect to the desired signal.

One of the inputs may be chosen as a reference for microphone alignment. For example, one of the alignment filters may be configured to produce an all-pass characteristic; the other alignment filters are configured accordingly. As a result, the outputs of each of the first stage beamformers with respect to the desired signal are sufficiently similar and also similar to the reference input.

The microphone alignment filters may be pre-configured by assuming and compensating for a known acoustical relation between the origin of the desired signal and the microphones and using microphones with very small variations in sensitivities. The microphone sensitivities may be estimated in a calibration step at the time of production. The microphone alignment filters may be estimated while the device is in operation: when activated by a voice or noise activity detector, the alignment filters are estimated by, e.g., a least squares technique.

Constraining the beamformer with respect to the desired signal may be equivalently achieved by integrating the microphone alignment filters directly into one or more of the beamformers' calculations, or, alternatively at the outputs of the first and second beamformers.

When the input signals (X_(L); X_(R)) are combined in this way, the input signal that exhibits the lowest noise level is emphasized over the other one.

The above expression for computing G_(L) and G_(R) is at least to some extent resistant to the influence of the desired signal and may work sufficiently well without any voice-activity detector, VAD.

The below expression is an alternative and is somewhat less resource demanding to compute, but is advantageously used in combination with a voice-activity detector, VAD:

${\overset{\sim}{G}}_{L} = \frac{\langle{X_{R}}^{2}\rangle}{{\langle{X_{R}}^{2}\rangle} + {\langle{X_{L}}^{2}\rangle}}$ ${\overset{\sim}{G}}_{R} = {{\overset{\sim}{G}}_{L} - 1}$

Where X_(R) and X_(L) are complex representations of the respective signals. This expression is subject to similar minimization and constraint as mentioned above but assumes that noise components in X_(R) and X_(L) are uncorrelated. In this case the voice-activity detector is applied to discard signal portions of X_(R) and X_(L) wherein voice is present for the purpose of estimating G_(L) and G_(R). Such a weighting rule was disclosed in U.S. Pat. No. 7,206,421 B1 for a multi-microphone input.

For more robust performance, G_(L) and G_(R) may be constrained further to an interval, say, between 0 and 1.

In general, it should be noted that the estimated position of the source emitting the desired signal may be pre-configured and locked to an expected position relative to the positions of the microphones. This could be the case for a headset, wherein the position of a person's mouth may be sufficiently well-defined when the headset is worn in a normal position. In other cases, the apparatus may comprise a tracker that estimates the position of the source of the desired signal from, e.g., phase and/or amplitude differences in the signals from one, two or more microphone pairs or sets of more than two microphones. This could be the case for a speakerphone or a hands-free set for a communications device in, e.g., a car.

The combined signal, X_(C), is input to a noise suppression unit 109 that computes a noise suppression gain, A_(S), from the beamformed signals X_(L) and X_(R). Additionally, the noise suppression unit 109 may include the microphone signals from one or more of the microphones 101, 102, 103, 104 in computing the noise suppression gain, A. The signals from M3 and M4 and the signal X_(R) output from the beamformer 106 are labelled ‘a’, ‘b’ and ‘c’ and are input to the noise suppression unit 109 as indicated by respective labels.

Computation of the noise suppression gain, A_(S), is described further below.

In the shown embodiment, the noise suppression gain, A_(S), is applied to the combined signal, X_(C), by a multiplier 108. A signal output from the multiplier is a reproduced audio signal comprising beamformed and noise suppressed signal components picked up by the microphones. Label ‘O’ designates output from the signal processor. The output may be subject to further signal processing, amplification and/or transmission.

FIG. 2 shows a more detailed block diagram of the signal processor. It is shown that the noise suppression gain, A_(S), is selected as either a first or left noise suppression gain, A_(L), or a second or right noise suppression gain, A_(R). The left noise suppression gain, A_(L), is computed from the beamformed signal X_(L) and/or the microphone signals xm₁ and/or xm₂. Correspondingly, the right noise suppression gain, A_(R), is computed from the beamformed signal X_(R) and/or the microphone signals xm₃ and/or xm₄.

A_(L) is applied to X_(L) via multiplier 205 and A_(R) is applied to X_(R) via multiplier 209. Respective outputs of the multipliers 205 and 209 are input to respective signal quality evaluators 203 and 208. The inputs may be interpreted as left and right noise-reduced, beamformed signals.

The signal quality evaluators 203 and 208 may evaluate the signal quality of the signals output from the multipliers 205 and 209 according to a criterion of signal-to-noise ratio. Alternatively, signal quality may be evaluated according to a criterion of noise signal power during a time interval when voice activity is detected as not present. This may be facilitated by applying the microphone alignment filters to render the desired signal component sufficiently identical at all beamformer inputs and outputs. In this case, signal-to-noise ratio and noise power are similar measures of signal quality. The signal quality evaluators output signals P_(L) and P_(R) that selects either A_(L) or A_(R) via a selector 204. A_(S), which is output from the selector represents the selected noise suppression gain and it is applied to X_(C) via a multiplier 108.

Signals P_(L) and P_(R) and hence the signal quality evaluators 203 and 208 may be defined as power computations on the noise component of the signals received as inputs. For example, P_(L) may be defined as the mean square of the beamformed, noise-reduced input during noise-only intervals. Averaging may be performed across a suitable time interval, e.g., 100 ms or 1 s, and across a suitable frequency interval, e.g. 0-8000 Hz.

The selector 204 may be configured to select A_(L) when P_(L) is less than P_(R) and conversely select A_(R) when P_(L) is larger than P_(R). Voice activity detectors 202 and 207 output signals to the signal quality evaluators 203 and 208, respectively, indicative of whether voice is detected.

A voice activity detector, VAD, of a single-input type, may be configured to estimate a noise floor level, N, by receiving an input signal and computing a slowly varying average of the magnitude of the input signal. A comparator may output a signal indicative of the presence of a voice signal when the magnitude of the signal temporarily exceeds the estimated noise floor by a predefined factor of, say, 10 dB. The VAD may disable noise floor estimation when the presence of voice is detected. Such a voice detector works when the noise is quasi-stationary and when the magnitude of voice exceeds the estimated noise floor sufficiently. Such a voice activity detector may operate at a band-limited signal or at multiple frequency bands to generate a voice activity signal aggregated from multiple frequency bands. When the voice activity detector works at multiple frequency bands, it may output multiple voice activity signals for respective multiple frequency bands.

A voice activity detector, VAD, of a multiple-input type, may be configured to compute a signal indicative of coherence between multiple signals. For example, the voice signal may exhibit a higher level of coherence between the microphones due to the mouth being closer to the microphones than the noise sources. Other types of voice activity detectors are based on computing spatial features or cues such as directionality and proximity, and, dictionary approaches decomposing signal into codebook time/frequency profiles.

A noise suppression gain designated G_(NS) or A_(L) or A_(R) may be computed from the following expression:

$G_{NS} = \frac{{X}^{2}}{{X}^{2} + {P_{N}F}}$

Wherein P_(N) is the square of the estimated noise floor level at a time instance t; |X|² is the square of the input signal at the time instance t; and F is a factor, e.g., a factor of 10. The noise suppression gain affects an input signal via a multiplier, if applied in a frequency domain.

Thus, on the one hand, if the noise floor level is very low, G_(NS) becomes 1 when voice is significantly present. On the other hand, if voice is absent or the noise level rises, G_(NS) moves to values less than 1 and consequently a suppression of the input signal. The factor F is selected to set how aggressively the input signal should be suppressed.

In respect of the above description of a voice-activity detector and noise suppression gain, its input signal(s) may be any of the microphone signals and/or output from the first beamformer and/or second beamformer and/or third beamformer.

In general, a way to estimate the signal and noise relation is based on tracking the noise floor, wherein voice or noisy voice is identified by signal parts significantly exceeding the noise floor level. Noise levels may, e.g., be estimated by minimum statistics as in [R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics,” Trans. on Speech and Audio Processing, Vol. 9, No. 5, July 2001], where the minimum signal level is adaptively estimated.

Other ways to identify signal and noise parts are based on computing multi-microphone/spatial features such as directionality and proximity [O. Yilmaz and S. Rickard, “Blind Separation of Speech Mixtures via Time-Frequency Masking”, IEEE Transactions on Signal Processing, Vol. 52, No. 7, pages 1830-1847, July 2004] or coherence [K. Simmer et al., “Post-filtering techniques.” Microphone Arrays. Springer Berlin Heidelberg, 2001. 39-60]. Dictionary approaches decomposing signal into codebook time/frequency profiles may also be applied [M. Schmidt and R. Olsson: “Single-channel speech separation using sparse non-negative matrix factorization,” Interspeech, 2006].

In general, noise suppression may be implemented as described in [Y. Ephraim and D. Malah, “Speech enhancement using optimal non-linear spectral amplitude estimation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, 1983, pp. 1118-1121] or as described elsewhere in the literature on noise suppression techniques. Typically, a time-varying filter is applied to the signal. Analysis and/or filtering are often implemented in a frequency transformed domain/filter bank, representing the signal in a number of frequency bands. At each represented frequency, a time-varying gain is computed depending on the relation of estimated desired signal and noise components e.g. when the estimated signal-to-noise ratio exceeds a pre-determined, adaptive or fixed threshold, the gain is steered toward 1. Conversely, when the estimated signal-to-noise ratio does not exceed the threshold, the gain is set to a value smaller than 1. The labels designated ‘x’ and ‘y’ connect the respective signals: x-to-x and y-to-y.

FIG. 3 shows different configurations of an apparatus with multiple microphones. On the left-hand side, a spectacle frame 303 with bows 306 are configured with two sets of microphones 304 and 305. On the right-hand side, a flexible neckband 307 is configured with two sets of microphones 308 and 309. Reference numeral 301 designates the head of a person wearing the spectacle frame 303 and reference numeral 302 designates the head of a person wearing the neckband 307.

The microphones may be arranged in a so-called end-fire configuration wherein the microphones of a respective pair or set of microphones sit on a line that intersects with or passes close to a position of a source of a desired signal. The position may be a position of the person's mouth opening or a position in proximity of the person's mouth opening. In an end-fire configuration the microphones of a microphone pair sit on a straight line intersecting the position of the source of the desired signal. Such a configuration is found to be suitable for effectively suppressing or cancelling noise from sources located elsewhere when the apparatus is a headset, hearing aid or the like.

In alternative configurations, a so-called broadside configuration for the microphone positions is used. In a broadside configuration the microphones of a microphone pair sit on a straight line at an equal distance to the position of the source of the desired signal.

In still alternative configurations, the microphones of a microphone pair sit on a line inclined e.g. at 5°, 10°, 45° relative to a direction from the microphone pair to the position of the source of the desired signal, thereby providing a configuration that may be more practically suitable.

Generally, in the above it is assumed that so-called digital microphones outputting digital signals are used. However, analogue microphones in conjunction with an analogue-to-digital converter or any other transduction from the sound field to a sampled domain could be used. The microphones are typically embodied in so-called capsules with a diameter in the range of typically 3 mm to 5 mm or 6 mm.

In general, a beamformer may receive signals from more than a pair of microphones. A beamformer, e.g., a first stage beamformer, may receive microphone signals from 3, 4 or more microphones. The first stage may comprise more than the first and the second beamformer; the first stage may comprise, e.g., 3, 4 or more beamformers. 

1. An apparatus, such as a headset, configured to process audio signals from multiple microphones, comprising: a first pair of microphones outputting a first pair of microphone signals and a second pair of microphones outputting a second pair of microphone signals; wherein the first pair of microphones are arranged with a first mutual distance and the second pair of microphones are arranged with a second mutual distance, and wherein the first pair of microphones are arranged at a distance from the second pair of microphones that is greater than the first mutual distance and the second mutual distance at least when the apparatus is in normal operation; a first beamformer and a second beamformer each configured to receive a pair of microphone signals and adapt the spatial sensitivity of a respective pair of microphones as measured in a respective beamformed signal (X_(L); X_(R)) output from a respective beamformer; wherein the spatial sensitivity is adapted to suppress noise relative to a desired signal; a third beamformer configured to dynamically combine the signals (X_(L); X_(R)) output from the first beamformer and the second beamformer into a combined signal (X_(C)); wherein the signals are combined such that noise energy in the combined signal is minimized while a desired signal is preserved; a noise reduction unit configured to process the combined signal (X_(C)) from the third beamformer and output the combined signal such that noise is reduced.
 2. An apparatus according to claim 1, wherein the noise reduction unit is configured to perform noise suppression on the combined signal (X_(C)) from the third beamformer in response to a noise suppression coefficient (A_(L); A_(R)); and wherein the noise suppression coefficient (A_(L); A_(R)) is estimated from the microphone signals and/or a beamformed signal (X_(L); X_(R)).
 3. An apparatus according to claim 1, wherein the apparatus comprises: a first control branch synthesizing a first noise suppression gain, A_(L), from the first pair of microphone signals and/or the first beamformer; a second control branch synthesizing a second noise suppression gain, A_(R), from the second pair of microphone signals and/or the second beamformer; a selector configured to dynamically select and/or output the first noise suppression gain, A_(L), or the second noise suppression gain, A_(R); wherein the noise reduction unit is configured to process the combined signal from the third beamformer in response to the selected and/or output noise suppression gain, A_(S), from the selector.
 4. An apparatus according to claim 3, wherein the selector is configured to operate in response to a first signal quality indicator (P_(L)) and a second signal quality indicator (P_(R)); and wherein the signal quality indicators (P_(L); P_(R)) are synthesized from a respective beamformed signal (X_(L); X_(R)) processed to reduce noise in response to respective noise reduction gains (A_(L); A_(R)).
 5. An apparatus according to claim 3, wherein a beamformed signal (X_(L); X_(R)), processed to reduce noise in response to respective noise reduction gains (A_(L); A_(R)), is input to an evaluator that is configured to output a control signal (P_(L); P_(R)) to the selector and thereby control selection; and wherein the evaluator evaluates the beamformed signal (X_(L); X_(R)), processed to reduce noise in response to respective noise reduction gains (A_(L); A_(R)), according to a criterion of least power during a time interval when voice activity is detected as not present.
 6. An apparatus according to claim 2, wherein the noise suppression coefficient is computed to reduce noise by a predetermined, fixed factor.
 7. An apparatus according to claim 1, wherein at least one of the first beamformer or second beamformer is configured to comprise: a first stage that generates a summation signal and a difference signal from the input signals, subject to at least one of the input signals being phase and/or amplitude aligned with another of the input signals with respect to a desired signal; and a second stage that filters the difference signal and generating a filtered signal; wherein the beamformed output signal is generated from the difference between the summation signal and the filtered signal; and wherein the filter is adapted using a least mean square technique to minimize the power of the beamformed output signal.
 8. An apparatus according to claim 1, wherein the third beamformer is configured with a fixed sensitivity with respect to a predefined spatial position relative to the spatial position of the microphones.
 9. An apparatus according to claim 1, wherein the microphones output digital signals; wherein the apparatus performs a transformation of the digital signals to a time-frequency representation, in multiple frequency bands; and wherein the apparatus performs an inverse transformation of at least the combined signal to a time-domain representation.
 10. An apparatus according to claim 1, wherein the microphones output analogue signals; wherein the apparatus performs analogue-to-digital conversion of the analogue signals to provide digital signals; wherein the apparatus performs a transformation of the digital signals to a time-frequency representation, in multiple frequency bands; and wherein the apparatus performs an inverse transformation of at least the combined signal to a time-domain representation.
 11. An apparatus according to claim 1, wherein the microphones of at least one pair of the set of microphones is arranged in an end-fire configuration oriented towards a position where a person's mouth is expected to be when the apparatus is used by the person.
 12. A method for processing audio signals from multiple microphones, comprising: receiving a first pair and a second pair of microphone signals from a first pair of microphones and a second pair of microphones, respectively; wherein the first pair of microphones are arranged with a first mutual distance and the second pair of microphones are arranged with a second mutual distance, and wherein the first pair of microphones are arranged at a distance from the second pair of microphones that is greater than the first mutual distance and the second mutual distance at least when the apparatus is in normal operation; performing first beamforming and second beamforming on the first pair of microphone signals and the second pair of microphone signals to output respective beamformed signals (X_(L); X_(R)); adapting the spatial sensitivity by a respective pair of microphones as measured in a respective beamformed signal (X_(L); X_(R)) such that spatial sensitivity is adapted to suppress noise relative to a desired signal; performing third beamforming to dynamically combine the signals (X_(L); X_(R)) output from the first beamforming and the second beamforming into a combined signal (X_(C)); wherein the signals are combined such that noise energy in the combined signal is minimized while a desired signal is preserved; performing noise reduction to process the combined signal (X_(C)) from the third beamformer and output the combined signal such that noise is reduced.
 13. A computer program product comprising program code means adapted to cause a data processing system to perform the steps of the method according to claim 12, when said program code means are executed on the data processing system.
 14. A computer program product according to claim 13, comprising a computer-readable medium having stored thereon the program code means.
 15. A computer data signal embodied in a carrier wave and representing sequences of instructions which, when executed by a processor, cause the processor to perform the steps of the method according to claim
 12. 